Visualizing Alienation: Mapping Emotional Spaces in FrankensteinΒΆ
Joost Burgers with the help of Claude Sonnet V4ΒΆ
This brief analysis uses sentiment analysis to understand the emotional valence of the locations and characters in Mary Shelley's Frankenstein.
Introduction & MethodologyΒΆ
The data was created by splitting the text of Frankenstein into Vol. Chapter and paragraphs. For each paragraph, the location of the narrative present was noted. This is distinct from all the locations that are mentioned in the text. For example, South America is mentioned, but as the text never goes there it is not part of this data set. Subsequently, the roBERTa Sentiment analyzer was run on each paragraph and the aggregate score per location was registered.
A follow up analysis was performed of the sentiments surrounding each character. This is less accurate as the calculation purely replies on character name as indicative of character presence. No attempt was made to reconcile pronouns with characters. Thus, "I" could be Walton, Victor, or The Monster. Without significantly more work, these distinctions cannot be recovered from the data without supervision.
# Load analysis results from parquet files
# This cell is hidden from HTML export
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py
# Configure Plotly for HTML export - this ensures charts work in exported HTML
py.init_notebook_mode(connected=False) # Use offline mode for HTML export
import plotly.io as pio
pio.renderers.default = "notebook" # Ensure charts render in notebook and HTML
try:
# Load all datasets from parquet files (fast and efficient)
frankenstein_all_with_sentiment = pd.read_parquet("frankenstein_all_paragraphs_with_sentiment.parquet")
frankenstein_all_with_sentiment.to_csv("frankenstein_all_paragraphs_with_sentiment.csv")
character_sentiment_df = pd.read_parquet("frankenstein_character_sentiment.parquet")
location_sentiment_summary = pd.read_parquet("frankenstein_location_sentiment.parquet")
frankenstein_manual_locations = pd.read_parquet("frankenstein_manual_locations.parquet")
# Set up coordinate columns
coords_columns = list(frankenstein_manual_locations.columns[-2:])
lat_col = coords_columns[0]
lon_col = coords_columns[1]
except FileNotFoundError as e:
print(f"β Error loading data: {e}")
Part I: The Geographic Imagination of FrankensteinΒΆ
The text has a remarkably large geographic canvas given its relative brevity. Some of this is no doubt due to the fact that Mary Shelley traveled quite a bit during the composition of the text.
Geographic Distribution and Narrative WeightΒΆ
The map below shows all the locations in Frankenstein where the text travels. The circle sizes represent the total word count associated with each location. Note that this does not necessarily indicate how long the text stays in that location in terms of narrative duration. For example, Victor is in Ingolstadt for quite some time, but the relative amount of text there is quite small.
Interactive Geographic MapΒΆ
# Geographic Distribution Map - Clean Version
try:
# Calculate location counts for sizing
valid_coords = frankenstein_manual_locations[
(frankenstein_manual_locations[lat_col].notna()) &
(frankenstein_manual_locations[lon_col].notna())
].copy()
valid_coords['word_count'] = valid_coords['paragraph_text'].str.split().str.len()
total_narrative_words = frankenstein_manual_locations['paragraph_text'].str.split().str.len().sum()
# Clean location names to handle duplicates like "Delacey Cottage"
valid_coords['curated_name_clean'] = valid_coords['curated_name'].str.strip()
# Group locations that are essentially the same (like multiple "Delacey Cottage" entries)
# First aggregate by cleaned name to get representative coordinates
location_coords = valid_coords.groupby('curated_name_clean').agg({
lat_col: 'first', # Use first occurrence coordinates
lon_col: 'first'
}).reset_index()
# Then sum word counts by cleaned name
location_counts = valid_coords.groupby('curated_name_clean').agg({
'word_count': 'sum'
}).reset_index()
# Merge coordinates back
location_counts = location_counts.merge(location_coords, on='curated_name_clean')
location_counts = location_counts.rename(columns={'curated_name_clean': 'curated_name'})
location_counts = location_counts.rename(columns={'word_count': 'total_words'})
location_counts['narrative_percent'] = (location_counts['total_words'] / total_narrative_words * 100).round(2)
# Create the geographic map
fig_geo = px.scatter_map(
location_counts,
lat=lat_col,
lon=lon_col,
hover_name="curated_name",
size="total_words",
size_max=40,
hover_data={
"narrative_percent": ":.2f",
"total_words": True,
lat_col: False,
lon_col: False
},
title="Geographic Locations in Frankenstein: Narrative Distribution",
labels={"narrative_percent": "% of Total Narrative"},
zoom=3,
height=700,
color_discrete_sequence=['#2E86AB']
)
fig_geo.update_layout(
mapbox_style="carto-positron",
margin={"r":0,"t":50,"l":0,"b":0}
)
# Configure for HTML export - embed the plot with full JavaScript
fig_geo.update_layout(
font=dict(size=12),
title_font=dict(size=16),
)
# Note: scatter_map doesn't support marker line outlines
# The circles will be solid without outlines for this map type
# Show with offline configuration for HTML export
py.iplot(fig_geo, show_link=False, config={'displayModeBar': True})
# Display insights
print(f"π Most significant locations by word count:")
top_locations = location_counts.nlargest(5, 'total_words')
for _, row in top_locations.iterrows():
print(f" {row['curated_name']}: {row['narrative_percent']:.1f}% ({row['total_words']} words)")
except NameError:
print("β οΈ Data not loaded - please run the data loading cell first")
π Most significant locations by word count: Geneva: 24.2% (17387 words) Delacey Cottage: 14.1% (10135 words) Ingolstadt: 11.9% (8543 words) Artic: 10.6% (7583 words) Montanvert: 5.8% (4175 words)
Emotional Geography AnalysisΒΆ
Each of the locations above can also be analyzed for its emotional valence using sentiment analysis. This is an imperfect process. A sentiment analyzer uses a machine learning algorithm to figure out if the words in a sentence are "positive", "negative", or "neutral". In order to learn how to do this the algorithm has to be "trained" on real human speech. Most sentiment analyzers have been trained on product reviews online or on social media posts. These are necessarily short posts that are often charged with emotions. Emotional states in books are far more complex, so the results should be taken with a grain of salt.
# Debug and fix Unicode issues in location data
import re
def clean_unicode_surrogates(text):
"""Remove problematic Unicode surrogate characters"""
if isinstance(text, str):
# Remove surrogate characters (U+D800 to U+DFFF)
return re.sub(r'[\uD800-\uDFFF]', '', text)
return text
def clean_dataframe_unicode(df):
"""Clean Unicode issues in all string columns of a DataFrame"""
df_cleaned = df.copy()
for column in df_cleaned.columns:
if df_cleaned[column].dtype == 'object':
df_cleaned[column] = df_cleaned[column].apply(clean_unicode_surrogates)
return df_cleaned
try:
# Check for problematic characters in location names
problematic_locations = []
for idx, row in location_sentiment_summary.iterrows():
try:
# Try to encode the location name
row['curated_name'].encode('utf-8')
except UnicodeEncodeError as e:
problematic_locations.append((idx, row['curated_name'], str(e)))
# Clean the location data
location_sentiment_summary_cleaned = clean_dataframe_unicode(location_sentiment_summary)
# Verify cleaning worked
# Update the global variable
location_sentiment_summary = location_sentiment_summary_cleaned
except Exception as e:
print(f"β Error during cleaning: {e}")
print(f"Error type: {type(e)}")
# Emotional Geography Map - Sentiment Analysis (Unicode Safe)
import re
def clean_text_for_display(text):
"""Clean text for safe display, removing problematic Unicode characters"""
if pd.isna(text) or not isinstance(text, str):
return str(text)
# Remove surrogate pairs and other problematic characters
text = re.sub(r'[\uD800-\uDFFF]', '', text) # Remove surrogates
text = re.sub(r'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F]', '', text) # Remove control characters
return text
try:
# Clean all text data for safe display and handle duplicate locations
location_data_safe = location_sentiment_summary.copy()
location_data_safe['curated_name'] = location_data_safe['curated_name'].apply(clean_text_for_display)
location_data_safe['sentiment_category'] = location_data_safe['sentiment_category'].apply(clean_text_for_display)
# Clean location names to handle duplicates like "Delacey Cottage"
location_data_safe['curated_name_clean'] = location_data_safe['curated_name'].str.strip()
# Aggregate duplicate locations by summing word counts and averaging sentiment
location_data_safe = location_data_safe.groupby('curated_name_clean').agg({
'lat': 'first',
'long': 'first',
'total_words': 'sum',
'avg_sentiment': 'mean',
'narrative_percent': 'sum',
'sentiment_category': lambda x: x.mode().iloc[0] if not x.empty else x.iloc[0]
}).reset_index()
location_data_safe = location_data_safe.rename(columns={'curated_name_clean': 'curated_name'})
# Create sentiment-enhanced map using cleaned data
fig_sentiment = px.scatter_map(
location_data_safe,
lat='lat',
lon='long',
hover_name='curated_name',
size="total_words",
size_max=35,
color="avg_sentiment",
color_continuous_scale='RdYlGn',
color_continuous_midpoint=0,
hover_data={
"narrative_percent": ":.2f",
"avg_sentiment": ":.3f",
"sentiment_category": True,
'lat': False,
'long': False
},
title="Emotional Geography of Frankenstein: Location Sentiment Analysis",
labels={
"avg_sentiment": "Average Sentiment",
"narrative_percent": "% of Total Narrative"
},
zoom=3,
height=700
)
fig_sentiment.update_layout(
mapbox_style="carto-positron",
margin={"r":0,"t":50,"l":0,"b":0},
coloraxis_colorbar=dict(
title="Sentiment Score",
tickvals=[-0.4, -0.2, 0, 0.2, 0.4],
ticktext=["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
)
)
# Configure for HTML export
fig_sentiment.update_layout(
font=dict(size=12),
title_font=dict(size=16),
)
# Note: scatter_map doesn't support marker line outlines
# The circles will use their sentiment colors without outlines
# Show with offline configuration for HTML export
py.iplot(fig_sentiment, show_link=False, config={'displayModeBar': True})
# Display sentiment insights with cleaned text
avg_overall_sentiment = location_data_safe['avg_sentiment'].mean()
# Show most positive and negative locations with cleaned names
most_positive = location_data_safe.nlargest(3, 'avg_sentiment')[['curated_name', 'avg_sentiment']]
most_negative = location_data_safe.nsmallest(3, 'avg_sentiment')[['curated_name', 'avg_sentiment']]
print(f"\n⨠Most positively framed locations:")
for _, row in most_positive.iterrows():
clean_name = clean_text_for_display(row['curated_name'])
print(f" {clean_name}: {row['avg_sentiment']:.3f}")
print(f"\nβοΈ Most negatively framed locations:")
for _, row in most_negative.iterrows():
clean_name = clean_text_for_display(row['curated_name'])
print(f" {clean_name}: {row['avg_sentiment']:.3f}")
except NameError:
print("β οΈ Sentiment data not available - please run the data loading cell first")
except Exception as e:
print(f"β Error creating sentiment map: {e}")
print(f"Error type: {type(e).__name__}")
# Fallback: show basic sentiment statistics without the map
try:
avg_overall_sentiment = location_sentiment_summary['avg_sentiment'].mean()
print(f"\nπ Basic Sentiment Statistics:")
print(f"π Overall sentiment across all locations: {avg_overall_sentiment:.3f}")
sentiment_distribution = location_sentiment_summary['sentiment_category'].value_counts()
print(f"π Sentiment distribution: {sentiment_distribution.to_dict()}")
except Exception as fallback_error:
print(f"β Even fallback statistics failed: {fallback_error}")
β¨ Most positively framed locations: Windsor: 0.477 Edinburgh: 0.344 Chamonix: 0.333 βοΈ Most negatively framed locations: Holyhead: -0.410 Zurich: -0.238 Beach somewhere on the Irish Coast: -0.231
Animating Movements Through TimeΒΆ
We can also track where these emotions take place in time by splitting up the data further. By giving each location a chronological number, we can get the sentiment for that particular event at that particular location rather than the average sentiment per location. After all, sometimes Victor is happy in Geneva and sometimes he is sad.
Chronological Emotion AnimationΒΆ
frankenstein_emotion_sequence_df = pd.read_csv("frankenstein_paragraphs_geoparsed_located_chrono.csv")
# Work with chronological data
df = frankenstein_emotion_sequence_df.copy()
df['word_count'] = df['paragraph_text'].str.split().str.len()
# Check what sentiment columns are available
# Use compound score if available, otherwise calculate from pos/neg
if 'roberta_compound' in frankenstein_all_with_sentiment.columns:
sentiment_col = 'roberta_compound'
df['sentiment_score'] = frankenstein_all_with_sentiment[sentiment_col].values
elif 'roberta_pos' in frankenstein_all_with_sentiment.columns and 'roberta_neg' in frankenstein_all_with_sentiment.columns:
df['sentiment_score'] = (frankenstein_all_with_sentiment['roberta_pos'].values -
frankenstein_all_with_sentiment['roberta_neg'].values)
else:
print("β οΈ Using first available sentiment column")
sentiment_col = sentiment_cols[0] if sentiment_cols else 'roberta_neg'
df['sentiment_score'] = frankenstein_all_with_sentiment[sentiment_col].values
# Clean and prepare data
df_clean = df.dropna(subset=['lat', 'long']).copy()
# Clean location names to handle duplicates like "Delacey Cottage"
df_clean['curated_name_clean'] = df_clean['curated_name'].str.strip()
# Aggregate by ordinal (chronological step) and cleaned location name
animation_data = df_clean.groupby(['ordinal', 'curated_name_clean', 'lat', 'long']).agg({
'word_count': 'sum', # Total words for this location at this time
'sentiment_score': 'mean', # Average sentiment
'text_section': 'first',
'chapter_letter': 'first'
}).reset_index()
# Rename back to curated_name for display
animation_data = animation_data.rename(columns={'curated_name_clean': 'curated_name'})
# Add frame information
animation_data['frame_label'] = animation_data['ordinal'].astype(str)
animation_data['chapter_info'] = (animation_data['text_section'].str.replace('_', ' ') +
' ' + animation_data['chapter_letter']).str.title()
# Add sentiment category for hover info
animation_data['sentiment_category'] = animation_data['sentiment_score'].apply(
lambda x: 'Positive' if x > 0.1 else ('Negative' if x < -0.1 else 'Neutral')
)
# Create the animated map
fig_animated = px.scatter_map(
animation_data,
lat="lat",
lon="long",
hover_name="curated_name",
size="word_count", # Size = word count (as requested)
size_max=60,
color="sentiment_score", # Color = sentiment
color_continuous_scale='RdYlGn',
color_continuous_midpoint=0,
animation_frame="frame_label",
hover_data={
"sentiment_score": ":.3f",
"sentiment_category": True,
"word_count": True,
"chapter_info": True,
"ordinal": True
},
title="Emotional Journey Through Frankenstein: Chronological Animation<br><sub>Size = Word Count | Color = RoBERTa Sentiment Score</sub>",
labels={
"sentiment_score": "Sentiment Score",
"sentiment_category": "Sentiment",
"word_count": "Words",
"chapter_info": "Chapter",
"ordinal": "Chronological Step"
},
zoom=3,
height=850
)
# Style the map
fig_animated.update_layout(
mapbox_style="carto-positron",
margin={"r":0,"t":90,"l":0,"b":0},
coloraxis_colorbar=dict(
title="Sentiment Score",
tickvals=[-0.6, -0.3, 0, 0.3, 0.6],
ticktext=["Very Negative", "Negative", "Neutral", "Positive", "Very Positive"]
)
)
# Configure for HTML export
fig_animated.update_layout(
font=dict(size=12),
title_font=dict(size=16),
)
# Animation settings - slower for better observation
fig_animated.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 1000 # 1 second per frame
fig_animated.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = .01 # 0.01 second transition
# Add step info to frames with more context
for i, frame in enumerate(fig_animated.frames):
step = i + 1
current_step_data = animation_data[animation_data['ordinal'] == step]
if not current_step_data.empty:
# Get the chapter info and locations for this step
chapter_info = current_step_data['chapter_info'].iloc[0]
locations = current_step_data['curated_name'].tolist()
location_text = ", ".join(locations) if len(locations) <= 3 else f"{', '.join(locations[:2])}, +{len(locations)-2} more"
frame.layout.title = f"{chapter_info} - {location_text}"
else:
frame.layout.title = f"Frankenstein Journey - Step {step}/{len(fig_animated.frames)}"
# Show with offline configuration for HTML export
py.iplot(fig_animated, show_link=False, config={'displayModeBar': True})
Part II: Character Sentiment AnalysisΒΆ
We can also try to extract the emotions around each character. This is an even murkier process. As the locations were all manually identified, we can be certain that each paragraph is set at that particular location. With characters it is hard to know if they are actually present in that particular paragraph unless they are manually coded as being there. For example, Victor may mention his father in a paragraph, but Alphonse may not actually be there. Furthermore, paragraphs may simply use pronouns (he/she) to refer to a character. This then his the sentiment when a specific character is mentioned in a particular paragraph.
Character Emotional FramingΒΆ
# Character Sentiment Analysis Visualizations
try:
# Sort by sentiment for better visualization
character_df_sorted = character_sentiment_df.sort_values('Avg_Sentiment', ascending=False)
# 1. Character Sentiment Overview Bar Chart
fig1 = px.bar(
character_df_sorted,
x='Character',
y='Avg_Sentiment',
color='Avg_Sentiment',
color_continuous_scale='RdYlGn',
color_continuous_midpoint=0,
title='Character Emotional Framing: Average Sentiment by Character',
labels={'Avg_Sentiment': 'Average Sentiment Score'},
hover_data=['Total_Mentions', 'Total_Words']
)
fig1.add_hline(y=0, line_dash="dash", line_color="gray",
annotation_text="Neutral Baseline", annotation_position="top right")
fig1.update_layout(
height=500,
xaxis_title="Character",
yaxis_title="Average Sentiment Score",
showlegend=False
)
# Show with offline configuration for HTML export
py.iplot(fig1, show_link=False, config={'displayModeBar': True})
# 2. Sentiment Distribution Stack Chart
fig2 = go.Figure()
fig2.add_trace(go.Bar(
name='Positive Mentions',
x=character_df_sorted['Character'],
y=character_df_sorted['Positive_Mentions'],
marker_color='#2E8B57',
opacity=0.8
))
fig2.add_trace(go.Bar(
name='Neutral Mentions',
x=character_df_sorted['Character'],
y=character_df_sorted['Neutral_Mentions'],
marker_color='#708090',
opacity=0.8
))
fig2.add_trace(go.Bar(
name='Negative Mentions',
x=character_df_sorted['Character'],
y=character_df_sorted['Negative_Mentions'],
marker_color='#CD5C5C',
opacity=0.8
))
fig2.update_layout(
barmode='stack',
title='Character Emotional Complexity: Sentiment Distribution by Character',
xaxis_title='Character',
yaxis_title='Number of Paragraphs',
height=500
)
# Show with offline configuration for HTML export
py.iplot(fig2, show_link=False, config={'displayModeBar': True})
# 3. Character Frequency vs Sentiment Scatter Plot
fig3 = px.scatter(
character_df_sorted,
x='Total_Mentions',
y='Avg_Sentiment',
size='Total_Words',
color='Avg_Sentiment',
color_continuous_scale='RdYlGn',
color_continuous_midpoint=0,
hover_name='Character',
title='Character Analysis: Narrative Presence vs Emotional Framing',
labels={
'Total_Mentions': 'Number of Paragraph Mentions',
'Avg_Sentiment': 'Average Sentiment Score',
'Total_Words': 'Total Words in Context'
}
)
# Add reference lines
fig3.add_hline(y=0, line_dash="dash", line_color="gray", opacity=0.5)
fig3.add_vline(x=character_df_sorted['Total_Mentions'].median(),
line_dash="dash", line_color="gray", opacity=0.5)
fig3.update_layout(height=500)
# Show with offline configuration for HTML export
py.iplot(fig3, show_link=False, config={'displayModeBar': True})
# Display key insights
most_positive = character_df_sorted.iloc[0]
most_negative = character_df_sorted.iloc[-1]
most_mentioned = character_df_sorted.loc[character_df_sorted['Total_Mentions'].idxmax()]
print("π Character Analysis - Key Findings:")
print(f"β¨ Most positively portrayed: {most_positive['Character']} (sentiment: {most_positive['Avg_Sentiment']:.3f})")
print(f"βοΈ Most negatively portrayed: {most_negative['Character']} (sentiment: {most_negative['Avg_Sentiment']:.3f})")
print(f"π Most frequently mentioned: {most_mentioned['Character']} ({most_mentioned['Total_Mentions']} paragraphs)")
print(f"π Characters analyzed: {len(character_df_sorted)}")
print(f"\nπ Character Emotional Patterns:")
for _, row in character_df_sorted.iterrows():
pos_pct = (row['Positive_Mentions'] / row['Total_Mentions'] * 100)
neg_pct = (row['Negative_Mentions'] / row['Total_Mentions'] * 100)
neu_pct = (row['Neutral_Mentions'] / row['Total_Mentions'] * 100)
print(f" {row['Character']:>10}: {pos_pct:5.1f}% pos, {neu_pct:5.1f}% neu, {neg_pct:5.1f}% neg (avg: {row['Avg_Sentiment']:6.3f})")
except NameError:
print("β οΈ Character analysis data not available - please run the data loading cell first")
π Character Analysis - Key Findings:
β¨ Most positively portrayed: Agatha (sentiment: 0.107)
βοΈ Most negatively portrayed: Justine (sentiment: -0.181)
π Most frequently mentioned: Father (125 paragraphs)
π Characters analyzed: 11
π Character Emotional Patterns:
Agatha: 58.8% pos, 23.5% neu, 17.6% neg (avg: 0.107)
Ernest: 42.9% pos, 35.7% neu, 21.4% neg (avg: 0.023)
Elizabeth: 33.8% pos, 28.7% neu, 37.5% neg (avg: 0.018)
Felix: 36.8% pos, 36.8% neu, 26.3% neg (avg: -0.001)
Henry: 32.3% pos, 27.7% neu, 40.0% neg (avg: -0.008)
Father: 28.0% pos, 36.8% neu, 35.2% neg (avg: -0.012)
Mother: 16.0% pos, 40.0% neu, 44.0% neg (avg: -0.032)
Victor: 15.1% pos, 41.5% neu, 43.4% neg (avg: -0.127)
William: 16.7% pos, 25.0% neu, 58.3% neg (avg: -0.130)
Monster: 16.5% pos, 25.7% neu, 57.8% neg (avg: -0.136)
Justine: 12.5% pos, 12.5% neu, 75.0% neg (avg: -0.181)
Character Sentiment InsightsΒΆ
Perhaps the most interesting contrast here is that between Agatha and Justine. The contrast can perhaps be explained by the fact that they do not appear in the text too often, therefore the emotional states surrounding them are relatively isolated. Conversely, Henry and Elizabeth are more diffuse terms throughout the text and would be more likely to be "neutral" due to positive and negative emotions surrounding them. By this same logic, we would positive and negative emotions to surround Victor as well, but these are overwhelmingly negative. This again is only when Victor is mentioned so the tokenizer won't pick up on lines where Victor simply says "I", "me", or "we", which is all the time.